Inducing Romanization Systems

نویسندگان

  • Keiko Taguchi
  • Andrew Finch
  • Seiichi Yamamoto
  • Eiichiro Sumita
چکیده

We propose a method for inducing romanization systems directly from a bilingual alignment at the grapheme level. First, transliteration word pairs are aligned using a non-parametric Bayesian approach, and then for each grapheme sequence to be romanized, a particular romanization is selected according to a user-specified criterium. We apply our approach to the task of transliteration mining, and used Levenshtein distance as the selection criterium. We performed experiments on three languages with differing characteristics: Japanese, Russian and Chinese. Our experiments show that the mining system built from the induced romanization system is able to outperform existing baseline romanization systems. By extending our approach to induce romanization systems based on other criteria we expect our technique may find more general application in the future.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Method to Evaluate Romanization Systems: The Case of Romanizing Arabic Proper Nouns

The transliteration of Arabic proper nouns to other languages is usually based on the phonetic translation of these nouns into their phonetic Latin counterparts. Most of the dictionaries do not include most of these nouns, although some may have meanings. Transliteration is essential generally to Natural Language Processing (NLP) field and specifically to machine translation systems, cross-lang...

متن کامل

A Unified Model of Thai Romanization and Word Segmentation

Thai romanization is the way to write Thai language using roman alphabets. It could be performed on the basis of orthographic form (transliteration) or pronunciation (transcription) or both. As a result, many systems of romanization are in use. The Royal Institute has established the standard by proposing the principle of romanization on the basis of transcription. To ensure the standard, a ful...

متن کامل

Factored Machine Translation Systems for Russian-English

We describe the LIA machine translation systems for the Russian-English and English-Russian translation tasks. Various factored translation systems were built using MOSES to take into account the morphological complexity of Russian and we experimented with the romanization of untranslated Russian words.

متن کامل

Development and Testing of Transcription Software for a Southern Min Spoken Corpus

The usual challenges of transcribing spoken language are compounded for Southern Min (Taiwanese) because it lacks a generally accepted orthography. This study reports the development and testing of software tools for assisting such transcription. Three tools are compared, each representing a different type of interface with our corpus-based Southern Min lexicon (Tsay, 2007): our original Chines...

متن کامل

Automatic Romanization for Thai

There is a common need in romanizing words in the languages other than English for the global communication. Especially the romanization of proper names are inevitable. Since there is no a mutual standard, writing a Thai word in English letters is not trivial, and it is quite a labor intensive task if it cannot be computerized. In this paper, we propose a new romanization system aiming at initi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013